Focused Crawling

نویسندگان

  • Krishnan Suresh
  • Sivaramakrishnan Kaveri
چکیده

Focused crawling is an efficient mechanism for discovering resources of interest on the web. Link structure is an important property of the web that defines its content. In this thesis, FOCUS a novel focused crawler is described, which primarily uses the link structure of the web in its crawling strategy. It uses currently available search engine APIs, provided by Google, to construct a layered web graph. This layered model of the web is used to learn the link that leads to topic pages. It is formulated as an ordinal regression problem that when solved gives link distance of topic pages from any given page. This directly produces an ordering among the links to be crawled. The ordinal nature of the problem removes any need for a negative class which is a problem with the existing crawlers. The large scale nature of the web poses scaling problems to current ordinal regression solvers. To overcome this a novel large scale clustering based ordinal regression solver using Second Order Cone Programming(SOCP) is proposed. Proposed approach is implemented on top of nutch open source crawler, which uses wellknown MapReduce distributed model to make the crawler scalable. Experimental results on different datasets show that the proposed method is competitive.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Focused Crawling Techniques

The need for more and more specific reply to a web search query has prompted researchers to work on focused web crawling techniques for web spiders. Variety of lexical and link based approaches of focused web crawling are introduced in the paper highlighting important aspects of each. General Terms Focused Web Crawling, Algorithms, Crawling Techniques.

متن کامل

Profile-Based Focused Crawling for Social Media-Sharing Websites

We present a novel profile-based focused crawling system for dealing with the increasingly popular social media-sharing websites. In this system, we treat the user profiles as ranking criteria for guiding the crawling process. Furthermore, we divide a user’s profile into two parts, an internal part, which comes from the user’s own contribution, and an external part, which comes from the user’s ...

متن کامل

Focused Crawling Using Context Graphs

Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific category, and offer a potential solution to the currency problem. The major problem in focused crawling is performing appropriate credit assignment to differe...

متن کامل

On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis

Focused crawling is an important technique for topical resource discovery on the Web. The key issue in focused crawling is to prioritize uncrawled uniform resource locators (URLs) in the frontier to focus the crawling on relevant pages. Traditional focused crawlers mainly rely on content analysis. Link-based techniques are not effectively exploited despite their usefulness. In this paper, we pr...

متن کامل

Automatic Publication Data

In many universities it would be useful to have a database of publications that reflects the research results of the academic staffs. Such a database can be built by automatically retrieve publication information from faculties’ homepage. In this project, we deploy focused crawling to build such a system. We also proposed a new focused crawling heuristics based on URL classification. We compare...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007